Method for detection and tracking of deformable objects

ABSTRACT

A method for detecting and tracking a deformable object having a sequentially changing behavior, comprising: developing a temporal statistical shape model of the oscillatory behavior of the embedding function representing the object from prior motion; and then applying the model against future, sequential motion of the object in the presence of unwanted phenomena by maximizing the probability that the developed statistical shape model matches the sequential motion of the object in the presence of unwanted phenomena.

CROSS REFERENCE TO RELATED APPLICATION

This application claims priority from U.S. Provisional application No.60/705,061 filed Aug. 3, 2005, which is incorporated herein byreference.

TECHNICAL FIELD

This invention relates generally to object detection and moreparticularly to the detection and tracking of deformable objects.

BACKGROUND AND SUMMARY

As is known in the art, it is frequently desirable to detect and segmentan object from a background of other objects and/or from a background ofnoise. One application, for example, is in MRI where it is desired tosegment an anatomical feature of a human patient, such as, for example,a vertebra of the patent. In other cases it would be desirable tosegment a moving, deformable anatomical feature such as the heart.

In 1988, Osher and Sethian, in a paper entitled “Fronts propagation withcurvature dependent speed: Algorithms based on Hamilton-Jacobiformulations” J. of Comp. Phys., 79:12-49, 1988, introduced the levelset method, it being noted that a precursor of the level set method wasproposed by Dervieux and Thomasset in a paper entitled “A finite elementmethod for the simulation of Raleigh-Taylor instability”. Springer Lect.Notes in Math., 771:145-158, 1979, as a means to implicitly propagatehypersurfaces C(t) in a domain Ω⊂R^(n) by evolving an appropriateembedding function φ: Ω×[0,T]→R, where:C(t)={xεΩ|φ(x,t)=0}.  (1)

More generally, an embedding function is a real-valued height function\phi(x) defined at each point x of the image plane, such that thecontour C corresponds to all points x in the plane where \phi(x)=0:C={x|\phi(x)=0}

This is a way to represent a contour C implicitly. Rather than workingwith a contour C (moving the contour etc), one works with the function\phi. Moving the values of \phi will implicitly move the “embedded”contour. This is why \phi(x) is called an “embedding function”—it embedsthe contour as its zero-level or isoline to value 0.

The ordinary differential equation propagating explicit boundary pointsis thus replaced by a partial differential equation modeling theevolution of a higher-dimensional embedding function. The key advantagesof this approach are well-known: Firstly, the implicit boundaryrepresentation does not depend on a specific parameterization, duringthe propagation no control point regridding mechanisms need to beintroduced. Secondly, evolving the embedding function allows toelegantly model topological changes such as splitting and merging of theembedded boundary. In the context of shape modeling and statisticallearning of shapes, the latter property allows to construct shapedissimilarity measures defined on the embedding functions which canhandle shapes of varying topology. Thirdly, the implicit representation,equation (1), naturally generalizes to hypersurfaces in three or moredimensions. To impose a unique correspondence between a contour and itsembedding function one can constrain φ to be a signed distance function,i.e. |∇φ|=1 almost everywhere.

The first applications of the level set method to image segmentationwere pioneered in the early 90's by Malladi et al. in a paper entitled“A finite element method for the simulation of Raleigh-Taylorinstability: Springer Lect. Notes in Math., 771:145-158, 1979, byCaselles et al. in a paper entitled “Geodesic active contour.” In Proc.IEEE Intl. Conf. on Comp. Vis., pages 694-699, Boston, USA, 1995, byKichenassamy et al. in a paper entitled “Gradient flows and geometricactive contour models”: In IEEE Intl. Conf. on Comp. Vis., pages810-815, 1995 and by Paragios and Deriche in a paper entitled “Geodesicactive regions and level set methods for supervised texturesegmentation”: Int. J. of Computer Vision, 46(3): 223-247, 2002. Levelset implementations of the Mumford-Shah functional, see paper entitled:“Optimal approximations by piecewise smooth functions and associatedvariational problems”: Comm. Pure Appl. Math., 42:577-685, 1989 [14]were independently proposed by Chan and Vese, see paper entitled “Activecontours without edges”: IEEE Trans. Image Processing, 10(2):266-277,2001 and by Tsai et al. in a paper entitled “Model-based curve evolutiontechnique for image segmentation”: In Comp. Vision Patt. Recog., pages463-468, Kauai, Hawaii, 2001.

In recent years, researchers have proposed to introduce statisticalshape knowledge into level set based segmentation methods in order tocope with insufficient low-level information. While these priors wereshown to drastically improve the segmentation of familiar objects, sofar the focus has been on statistical shape priors (i.e., what are“priors”) which are static in time. Yet, in the context of trackingdeformable objects, it is clear that certain silhouettes (such as thoseof the beating human in an MRI application, or of a walking person inanother application) may become more or less likely over time. Leventonet al. in a paper entitled “Geometry and prior-based segmentation: In T.Pajdla and V. Hlavac, editors, European Conf. on Computer Vision, volume3024 of LNCS, pages 50-61, Prague, 2004. Springer, proposed to model theembedding function by principal component analysis (PCA) of a set oftraining shapes and to add appropriate driving terms to the level setevolution equation, Tsai et al. in a paper entitled “Curve evolutionimplementation of the Mumford-Shah functional for image segmentation,de-noising, interpolation, and magnification” IEEE Trans. on ImageProcessing, 10(8): 1169-1186, 2001 suggested performing optimizationdirectly within the subspace of the first few eigenmodes. Rousson et al.see “Shape priors for level set representations”: In A. Heyden et al.,editors, Proc. of the Europ. Conf. on Comp. Vis., volume 2351 of LNCS,pages 78-92, Copenhagen, May 2002. Springer, Berlin. and “Implicitactive shape models for 3d segmentation in MRI imaging”: In MICCAI,pages 209-216, 2004 suggested introduction of shape information on thevariational level, while Chen et al., see “Using shape priors ingeometric active contours in a variational framework”: Int. J. ofComputer Vision, 50(3):315-328, 2002 imposed shape constraints directlyon the contour given by the zero level of the embedding function. Morerecently, Riklin-Raviv et al. see European Conf. on Computer Vision,volume 3024 of LNCS, pages 50-61, Prague, 2004. Springer, proposed tointroduce projective invariance by slicing the signed distance functionat various angles.

In the above works, statistically learned shape information was shown tocope for missing or misleading information in the input images due tonoise, clutter and occlusion. The shape priors were developed to segmentobjects of familiar shape in a given image. However, although they canbe applied to tracking objects in image sequences see [Cremers et al.,“Nonlinear shape statistics in Mumford-Shah based segmentation”: In A.Heyden et al., editors, Europ. Conf. on Comp. Vis., volume 2351 of LNCS,pages 93-108, Copenhagen, May 2002. Springer], [Moelich and Chan,“Tracking objects with the Chan-Vese algorithm”, Technical Report 03-14,Computational Applied Mathematics, UCLA, Los Angeles, 2003] and [Cremerset al., “Kernel density estimation and intrinsic alignment forknowledge-driven segmentation”: Teaching level sets to walk. In PatternRecognition, volume 3175 of LNCS, pages 36-44. Springer, 2004], they arenot well suited for this task, because they neglect the temporalcoherence of silhouettes which characterizes many deforming shapes.

When tracking a three-dimensional deformable object over time, clearlynot all shapes are equally likely at a given time instance. Regularlysampled images of a walking person, for example, exhibit a typicalpattern of consecutive silhouettes. Similarly, the projections of arigid 3D object rotating at constant speed are generally not independentsamples from a statistical shape distribution. Instead, the resultingset of silhouettes can be expected to contain strong temporalcorrelations.

In accordance with the present invention, a method us provided fordetecting and tracking a deformable object having an sequentiallychanging behavior wherein the method develops a temporal statisticalshape model of the sequentially changing behavior of the embeddingfunction representing the object from prior motion and then applies themodel against future, sequential motion of the object in the presence ofunwanted phenomena by maximizing the probability that the developedstatistical shape model matches the sequential motion of the object inthe presence of unwanted phenomena.

In accordance with another feature of the invention, a method generatesa dynamical model of the time-evolution of the embedding function ofprior observations of a boundary shape of an object, such object havingan observable, sequentially changing boundary shape; and subsequentlyusing such model for probabilistic inference about such shape of theobject in the future.

The method develops temporal statistical shape models for implicitlyrepresented shapes of the object. In particular, the shape probabilityat a given time depends is a function of the shapes of the objectobserved at previous times.

In one embodiment, the dynamical shape models are integrated into asegmentation process within a Bayesian framework for level set basedimage sequence segmentation.

In one embodiment, optimization is obtained by a partial differentialequation for the level set function. The optimization includes evolutionof an interface which is driven both by the intensity information of acurrent image as well as by a prior dynamical shape which relies on thesegmentations obtained on the preceding frames.

With such method, in contrast to existing approaches to segmentationwith statistical shape priors, the resulting segmentations are not onlysimilar to previously learned shapes, but they are also consistent withthe temporal correlations estimated from sample sequences. The resultingsegmentation process can cope with large amounts of noise and occlusionbecause it exploits prior knowledge about temporal shape consistency andbecause it aggregates information from the input images over time(rather than treating each image independently).

The development of dynamical models for implicitly represented shapesand their integration into image sequence segmentation on the basis ofthe Bayesian framework draws on much prior work in various fields. Thetheory of dynamical systems and time series analysis has a longtradition in the literature (see for example [A. Papoulis, Probability,Random Variables, and Stochastic Processes, McGraw-Hill, New York,1984]). Autoregressive models were developed for explicit shaperepresentations among others by Blake, Isard and coworkers [A. Blake andM. Isard. Active Contours. Springer, London, 1998]. In these works,successful tracking results were obtained by particle filtering based onedge-information extracted from the intensity images. Here, however, themethod of the present invention, differs from these in three ways:

-   -   Here the dynamical models are for implicitly represented shapes.        As a consequence, the dynamical shape model can automatically        handle shapes of varying topology. The model trivially extends        to higher dimensions (e.g. 3D shapes), since it does not need to        deal with the combinatorial problem of determining point        correspondences and issues of control point re-gridding        associated with explicit shape representations.    -   The method according to the present invention, integrates the        intensity information of the input images in a statistical        formulation inspired by [Zhu, Yuille 1996, Chan, Vese 1999].        This leads to a region-based tracking scheme rather than an        edge-based one. The statistical formulation implies that—with        respect to the assumed intensity models—the method optimally        exploits the input information. It does not rely on a        pre-computation of heuristically defined image edge features.        Yet, the assumed probabilistic intensity models are quite simple        (namely Gaussian distributions). More sophisticated models for        intensity, color or texture of objects and background could be        employed.    -   The Bayesian aposteriori optimization is solved in a variational        setting by gradient descent rather than by stochastic sampling        techniques. While this limits the algorithms used by the        invention to only track the most likely hypothesis (rather than        multiple hypotheses), it facilitates an extension to        higher-dimensional shape representations without the drastic        increase in computational complexity inherent to sampling        methods.

Recently, Goldenberg et al. [Goldenberg, Kimmel, Rivlin, and Rudzsky,Pattern Recognition, 38:1033-1043, July 2005.] successfully applied PCAto an aligned shape sequence to classify the behavior of periodic shapemotion. Though this work is also focused on characterizing movingimplicitly represented shapes, it differs from the present invention inthat shapes are not represented by the level set embedding function (butrather by a binary mask), it does not make use of autoregressive models,and it is focused on behavior classification of pre-segmented shapesequences rather than segmentation or tracking with dynamical shapepriors.

The details of one or more embodiments of the invention are set forth inthe accompanying drawings and the description below. Other features,objects, and advantages of the invention will be apparent from thedescription and drawings, and from the claims.

DESCRIPTION OF DRAWINGS

FIG. 1 shows a low-dimensional approximation of a set of trainingsilhouettes, the silhouettes on the top are hand segmented and thebottom silhouettes are approximated by PCA of their embedding functionsin accordance with the invention;

FIG. 2 are autocorrelation functions used to validate the autoregressivemodel according to the invention, the autocorrelation functions ofresiduals associated are plotted with the first four shape modes;

FIG. 3 are shape modes, the original shape sequence (left) and thesequence synthesized by a statistically learned second order Markovchain (right) according to the invention showing the temporal evolutionof the first, second and sixth shape eigenmode.

FIG. 4 are walking sequence generated by the process according to theinvention, sample silhouettes being generated by a statistically learnedsecond order Markov model on the embedding functions;

FIG. 5 shows samples from an image sequence with increasing amounts ofnoise.

FIG. 6 shows segmentation according to the invention using a staticshape prior for 25% noise;

FIG. 7 shows segmentation according to the invention using a staticshape prior for 50% noise;

FIG. 8 shows segmentation according to the invention using a dynamicalshape prior for 50\% noise;

FIG. 9 shows tracking according to the invention with dynamical shapedprior for 75% of noise;

FIG. 10 shows tracking according to the invention with dynamical shapeprior for 90% of noise;

FIG. 11 shows quantitative evaluation of the segmentation accuracyaccording to the invention.

FIG. 12 shows tracking in the presence of occlusion according to theinvention;

FIG. 13 is a flow diagram of the process according to the invention; and

FIG. 14 are statistically generated embedding surfaces obtained bysampling from a second order autoregressive model according to theinvention, and the contours given by the zero level lines of thesynthesized surfaces. The implicit formulation allows the embeddedcontour to change topology (bottom left image).

Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

Referring now to FIG. 13, a flow diagram is shown of a method fordetection and tracking of deformable objects. Here, to illustrate anexample of the method, the object is a walking person. It should beunderstood, however, that the method applies to other deformable objectsincluding anatomical objects such as the beating human heart.

Before discussing the steps in FIG. 13, a Bayesian formulation for levelset based image sequence segmentation will be introduced. First, thegeneral formulation in the space of embedding functions will bediscussed and subsequently describe a computationally efficientformulation in a low-dimensional subspace.

2.1. General Formulation

In the following, a shape is defined as a set of closed 2D contoursmodulo a certain transformation group, the elements of which are denotedby T_(θ) with a parameter vector θ. Depending on the application, thesemay be rigid-body transformations, similarity or affine transformationsor larger transformation groups. The shape is represented implicitly byan embedding function φ according to equation (1). Thus objects ofinterest will be given by φ(T_(θ)x), where the transformation T_(θ) actson the grid, leading to corresponding transformations of the implicitlyrepresented contour. Here, the shape φ is purposely separated from thetransformation parameters θ since one may want to use different modelsto represent and learn their respective temporal evolution.

Assume there are given consecutive images I_(t): Ω→R from an imagesequence, where I_(1:t) denotes the set of images {I₁,I₁, . . . , I_(t)}at different time instances. Using the Bayesian formula (with allexpressions conditioned on I_(1:t−1)), the problem of segmenting thecurrent frame I_(t) can then be addressed by maximizing the conditionalprobability $\begin{matrix}{{{P\left( {\phi_{t},{\theta_{t}\text{❘}I_{1\text{:}t}}} \right)} = \frac{{P\left( {{I_{t}\text{❘}\phi_{t}},\theta_{t},I_{{1\text{:}t} - 1}} \right)}{P\left( {\phi_{t},{\theta_{t}\text{❘}I_{{1\text{:}t} - 1}}} \right)}}{P\left( {I_{t}\text{❘}I_{{1\text{:}t} - 1}} \right)}},} & (2)\end{matrix}$with respect to the embedding function φ_(t) and the transformationparameters θ_(t). (Modeling probability distributions oninfinite-dimensional spaces is in general an open problem, includingissues of defining appropriate measures and of integrability. Thereforethe functions φ may be thought of as finite-dimensional approximationsobtained by sampling the embedding functions on a regular grid). For thesake of brevity, the philosophical interpretation of the Bayesianapproach will not be discussed. Suffice it to say here that that theBayesian framework can be seen as an inversion of the image formationprocess in a probabilistic setting.

The denominator in equation (2) does not depend on the estimatedquantities and can therefore be neglected in the maximization. Moreover,the second term in the numerator can be rewritten using theChapman-Kolmogorov [A. Papoulis, Probability, Random Variables, andStochastic Processes, McGraw-Hill, New York, 1984] equation:P(φ_(t),θ_(t) |I_(1:t−1))=∫P(φ_(t),θ_(t)|φ_(1:t−1),θ_(1:t−1))P(φ_(1:t−1),θ_(1:t−1) |I_(1:t−1))dφ _(1:t−1) dθ _(1:t−1)  (3)

In the following, several assumptions are made which are aimed atsimplifying expression in equation (2), leading to a computationallymore feasible estimation problem:

-   -   It is assumed that the images I_(1:t) are mutually independent:        P(I _(t)|φ_(t),θ_(t) ,I _(1:t−1))=P(I _(t)|φ_(t),θ_(t)).  (4)    -   It is assumed that the intensities of the shape of interest and        of the background are independent samples from two Gaussian        distributions with unknown means μ₁,₂ and variances σ₁,σ₂. As a        consequence, the data term above can be written as:        ${{P\left( {{I_{t}❘\phi_{t}},\theta_{t}} \right)} = {{\prod\limits_{\substack{x \\ {\phi_{t}{({T_{\theta_{t}}x})}} \geq 0}}^{\quad}{\frac{1}{\sqrt{2\pi}\sigma_{1}}{\exp\left( {- \frac{\left( {{I_{t}(x)} - \mu_{1}} \right)^{2}}{2\sigma_{1}^{2}}} \right)}{\prod\limits_{\substack{x \\ {\phi_{t}{({T_{\theta_{t}}x})}} < 0}}^{\quad}{\frac{1}{\sqrt{2\pi}\sigma_{2}}{\exp\left( {- \frac{\left( {{I_{t}(x)} - \mu_{2}} \right)^{2}}{2\sigma_{2}^{2}}} \right)}}}}} \propto {\exp\left( {{- {\int_{\Omega}^{\quad}{\left( {\frac{\left( {I_{t} - \mu_{1}} \right)^{2}}{2\sigma_{1}^{2}} + {\log\quad\sigma_{1}}} \right)H\quad{\phi_{t}\left( {T_{\theta_{t}}x} \right)}}}} + {\left( {\frac{\left( {I_{t} - \mu_{2}} \right)^{2}}{2\sigma_{2}^{2}} + {\log\quad\sigma_{2}}} \right)\left( {1 - {H\quad{\phi_{t}\left( {T_{\theta_{t}}x} \right)}}} \right){\mathbb{d}x}}} \right)}}},$    -   where the Heaviside step function Hφ≡H(φ) is introduced to        denote the areas where φ is positive (Hφ=1) or negative (Hφ=0).        Related intensity models have been proposed among others, see D.        Mumford and J. Shah “Optimal approximations by piecewise smooth        functions and associated variational problems”: Comm. Pure Appl.        Math., 42:577-685, 1989, S. C. Zhu and A. Yuille, “Region        competition: Unifying snakes, region growing”, and Bayes, “MDL        for multiband image segmentation”: IEEE PAMI, 18(9):884-900,        1996, and T. F. Chan and L. A. Vese “Active contours without        edges”: IEEE Trans. Image Processing, 10(2):266-277, 2001. The        model parameters μ_(i) and σ_(i) are estimated jointly with the        shape φ_(t) and the transformation θ_(t). Their optimal values        are given by the means and variances of the intensity I_(t)        inside and outside the current shape: $\begin{matrix}        {{\mu_{1} = {\frac{1}{a_{1}}{\int{I_{t}H\quad\phi_{t}{\mathbb{d}x}}}}},{\sigma_{1}^{2} = {{\frac{1}{a_{1}}{\int{I_{t}^{2}H\quad\phi_{t}{\mathbb{d}x}}}} - \mu_{1}^{2}}},{{{where}\quad a_{1}} = {\int{H\quad\phi_{t}{\mathbb{d}x}}}},} & (5)        \end{matrix}$    -   and similarly for μ₂ and σ₂ with Hφ_(t) replaced by (1−Hφ_(t)).        To keep the notation simple, these parameters are not displayed        as part of the dynamic variables.    -   To avoid the computational burden of considering all possible        intermediate shapes φ_(1:t−1) and transformations θ_(1:t−1) in        equation (3), it is assumed the distributions of previous states        to be strongly peaked around the maxima of the respective        distributions:        P(φ_(1:t−1),θ_(1:t−1) |I _(1:t−1))≈δ(φ_(1:t−1)−{circumflex over        (φ)}_(1:t−1))δ(θ_(1:t−1)−{circumflex over (θ)}_(1:t−1)),  (6)    -   where ({circumflex over (φ)}_(i),{circumflex over        (θ)}_(i))=argmaxP(φ_(i),θ_(i)|I_(1:i-1)) are the estimates of        shape and transformation obtained for the past frames, and δ(·)        denotes the Dirac delta function. An alternative justification        for this approximation is the following: Assume that due to        memory limitations, the tracking system cannot store the        acquired images, but that it merely stores the past estimates of        shape and transformation. Then the inference problem at time t        reduces to that of maximizing the conditional distribution        P(φ_(t),θ_(t) |I _(t),{circumflex over (φ)}_(1:t−1),{circumflex        over (θ)}_(1:t−1))∝P(I        _(t)|φ_(t),θ_(t))P(φ_(t),θ_(t)|{circumflex over        (φ)}_(1:t−1),{circumflex over (θ)}_(1:t−1))  (7)    -   with respect to the embedding function φ_(t) and the        transformation parameters θ_(t). This is equivalent to the        original inference problem see equation (2) subject to the        approximation see equation (6).    -   A central contribution of this paper is to model the joint prior        on shape φ_(t) and transformation θ_(t) conditioned on previous        shapes and transformations. To this end, two approximations are        considered:

In a first step, it is assumes that shape and transformation aremutually independent, i.e.P(φ_(t),θ_(t)|φ_(1:t−1),θ_(1:t−1))=P(φ_(t)|φ_(1:t−1))P(θ_(t)|θ_(1:t−1)),and assume a uniform prior on the transformation parameters, i.e.P(θ_(t),θ_(1:t−1))=const. This is complimentary to the recent work ofRathi et al., see Y. Rathi, N. Vaswani, A. Tannenbaum, and A. Yezzi.Particle filtering for geometric active contours and application totracking deforming objects. In IEEE Int. Conf. on Comp. Vision and Patt.Recognition, 2005 who proposed a temporal model for these transformationparameters, while not imposing any specific model on the shape.

In a second step, the more general case of a joint distributionP(φ_(t),θ_(t)|φ_(1:t−1),θ_(1:t−1)) of shape and transformationparameters is considered, taking into account couplings between shapeand transformation. Experimental results demonstrate that this leads tosuperior performance in dealing with occlusions.

2.2 A Finite-Dimensional Formulation

When estimating the conditional probability P(φ_(t),θ_(t)|{circumflexover (φ)}_(1:t−1),{circumflex over (θ)}_(1:t−1)) in (7) from sampledata, one needs to revert to finite-dimensional approximations of theembedding function. It is well known that statistical models can beestimated more reliably if the dimensionality of the model and the dataare low. The Bayesian inference is then recast in a low-dimensionalformulation within the subspace spanned by the largest principaleigenmodes of a set of sample shapes. Training sequence is thenexploited in a twofold way: Firstly, it serves to define alow-dimensional subspace in which to perform estimation. And secondly,within this subspace the process uses it to learn dynamical models forimplicit shapes.

Let {φ₁, . . . ,φ_(N)} be a temporal sequence of training shapes. It isassumed that all training shapes φ_(i) are signed distance functions.Yet an arbitrary linear combination of eigenmodes will in general notgenerate a signed distance function. While the proposed statisticalshape models favor shapes which are close to the training shapes (andtherefore close to the set of signed distance functions), not all shapessampled in the considered subspace will correspond to signed distancefunctions. Let φ₀ denote the mean shape and ψ₁, . . . ,ψ_(n) the nlargest eigenmodes with n□ N. The process then approximates eachtraining shape as: $\begin{matrix}{{{\phi_{i}(x)} = {{\phi_{0}(x)} + {\sum\limits_{j = 1}^{n}{\alpha_{ij}{\psi_{j}(x)}}}}},{where}} & (8) \\{\alpha_{ij} = {\left( {{\phi_{i} - \phi_{0}},\psi_{j}} \right) \equiv {\int{\left( {\phi_{i} - \phi_{0}} \right)\psi_{j}{{\mathbb{d}x}.}}}}} & (9)\end{matrix}$

Such PCA based representations of level set functions have beensuccessfully applied for the construction of statistical shape priors insee M. Leventon, W. Grimson, and O. Faugeras. Statistical shapeinfluence in geodesic active contours. In CVPR, volume 1, pages 316-323,Hilton Head Island, S.C., 2000, A. Tsai, A. Yezzi, W. Wells, C. Tempany,D. Tucker, A. Fan, E. Grimson, and A. Willsky. Model-based curveevolution technique for image segmentation. In Comp. Vision Patt.Recog., pages 463-468, Kauai, Hawaii, 2001, M. Rousson, N. Paragios, andR. Deriche. Implicit active shape models for 3d segmentation in MRIimaging. In MICCAI, pages 209-216, 2004 and M. Rousson and D. Cremers(M. Rousson and D. Cremers. MICCAI, volume 1, pages 757-764, 2005.).

Efficient kernel density estimation of shape and intensity priors forlevel set segmentation. In MICCAI, 2005. In the following, the vector ofthe first n eigenmodes will be denoted as ψ=(ψ₁, . . . ,ψ_(n)). Eachsample shape φ_(i) is therefore approximated by the n-dimensional shapevector α_(i)=(α_(i1), . . . ,α_(in)). Similarly, an arbitrary shape φcan be approximated by a shape vector of the formα_(φ)=(φ−φ₀,ψ).  (10)FIG. 1 shows a set of silhouettes from a sequence of a walking personand their approximation by the first 6 eigenmodes. While thisapproximation is certainly a rough approximation lacking some of thedetails of the shape, it was found sufficiently. More particularly, thesequence of six silhouettes in the top half of FIG. 1 are from handtracings according to Step 100 of FIG. 13 and the six silhouettes in thelower half of FIG. 1 are PCA approximations as in Step 102 of FIG. 13.Thus, FIG. 1 shows a low-dimensional approximation of a set of trainingsilhouettes. The silhouettes (top of FIG. 1) hand segmented inaccordance with Step 100, FIG. 13, are the bottom silhouettes areapproximated by the first 6 principal components (PCA) of theirembedding functions (bottom of FIG. 1)—see equation (8)

In analogy to the derivation presented in the previous section, the goalof image sequence segmentation within this subspace can then be statedas follows: Given consecutive images I_(t): Ω→R from an image sequence,and given the segmentations α_(1:t−1) and transformations {circumflexover (θ)}_(1:t−1) obtained on the previous images I_(1:t−1) the processmaximizes the conditional probability $\begin{matrix}{{{P\left( {\alpha_{t},{\theta_{t}\text{❘}I_{t}},\alpha_{{1\text{:}t} - 1},{\hat{\theta}}_{{1\text{:}t} - 1}} \right)} = \frac{{P\left( {{I_{t}\text{❘}\alpha_{t}},\theta_{t}} \right)}{P\left( {\alpha_{t},{\theta_{t}\text{❘}\alpha_{{1\text{:}t} - 1}},{\hat{\theta}}_{{1\text{:}t} - 1}} \right)}}{P\left( {{I_{t}\text{❘}\alpha_{{1\text{:}t} - 1}},{\hat{\theta}}_{{1\text{:}t} - 1}} \right)}},} & (11)\end{matrix}$with respect to the shape parameters α_(t) and the transformationparameters θ_(t). The conditional probability is modeled as:P(α_(t),θ_(t)|α_(1:t−1),{circumflex over (θ)}_(1:t−1)),  (12)which constitutes the probability for observing a particular shape α_(t)and a particular transformation θ_(t) at time t, conditioned on theparameter estimates for shape and transformation obtained on previousimages.

3. Dynamical Statistical Shape Models

Abundant theory has been developed to model temporally correlated timeseries data. Applications of dynamical systems to model deformableshapes were proposed among others in [A. Blake and M. Isard. ActiveContours. Springer, London, 1998]. Here, the process learns dynamicalmodels for implicitly represented shapes. To simplify the discussion,let us first focus on dynamical models of shape deformation. In otherwords, it is assumed that a uniform distribution on the transformationparameters and merely model the conditional distributionP(α_(t)|α_(1:t−1)).

3.1 Dynamical Models of Deformation

Referring again to FIG. 13, in Step 100, a hand-segmented image sequenceof a deformable object having a sequentially changing, for example,oscillatory behavior is obtained in any conventional manner. Moreparticularly, as described above in section 2.2, let {φ₁, . . . ,φ_(N)}be a temporal sequence of training shapes. The result is shown for awalking person example in FIG. 1 showing black training shapes on whiteground and level set functions for each shape. As described in section2.2, let φ₀ denote the mean shape and ψ₁, . . . ,ψ_(n) the n largesteigenmodes with n□ N.

In Step 102, the process computes the principal component, using PCA)representation of level set functions, this is called the shape vectoras in equations (9) and (10) below. The PCA representation of thesequence of silhouettes is shown in the lower 6 training shapes here 6silhouettes of FIG. 1. Note that the last one can observe that the inthe last one of the 6 silhouettes in the upper set the right foot has asharper toe than in the corresponding last one of the silhouettes in thelast one of the lower 6 shapes because of the approximation in a PCA.

Thus, from the PCA, the process then approximate each training shape as:in equations (8) and (9) above.

In Step 104, the process estimates a dynamical (autoregressive) modelfor the sequence of shape vectors. This is shown in equation (13) below.FIG. 3 shows the shape vectors of the input sequence (left) and of thesequence synthesized with the model (right). More particularly, FIG. 3is a model comparison; the original shape sequence (top) and thesequence synthesized by a statistically learned second order Markovchain (bottom) exhibit similar oscillatory behavior and amplitudemodulation. The plots show the temporal evolution of the first, secondand sixth shape eigenmode.

FIG. 5 shows samples from an image sequence with increasing noise. FIG.5 are images from a sequence with increasing amounts of noise, here asample input frame from a sequence with 25%, 50%, and 90% noise, where90% noise means that 90% of all pixels were replaced by a randomintensity sampled from a uniform distribution develops a temporalstatistical shape model of the oscillatory behavior of the embeddingfunction representing the object, here the walking person, from priormotion.

Still more particularly, the process learns the temporal dynamics of adeforming shape by approximating the shape vectors α_(t)≡α_(φ) _(t) of asequence of level set functions by a Markov chain ([Neumaier, Schneider2001]) of order k, i.e.:α_(t) =μ+A ₁α_(t−1) +A ₂α_(t−2) + . . . +A _(k)α_(t−k)+η,  (13)where η is zero-mean Gaussian noise with covariance Σ. The probabilityof a shape conditioned on the shapes observed in previous time steps istherefore given by the corresponding autoregressive model of order k:$\begin{matrix}{{{P\left( {\alpha_{t}\text{❘}\alpha_{{1\text{:}t} - 1}} \right)} \propto {\exp\left( {{- \frac{1}{2}}v^{*}{\sum^{- 1}v}} \right)}},{where}} & (14) \\{v \equiv {\alpha_{t} - \mu - {A_{1}\alpha_{t - 1}} - {A_{2}\alpha_{t - 2}\ldots} - {A_{k}\alpha_{t - k}}}} & (15)\end{matrix}$

Various methods have been proposed in the literature to estimate themodel parameters given by the mean μεR^(n) and the transition and noisematrices A₁, . . . ,A_(k),ΣεR^(n×n). Here, the process applies astepwise least squares algorithm proposed in [A. Neumaier and T.Schneider. Estimation of parameters and eigenmodes of multivariateautoregressive models. ACM T. on Mathematical Software, 27(1):27-57,2001]. Different tests have been devised to quantify the accuracy of themodel fit. Two established criteria for model accuracy are Akaike'sFinal Prediction Error see H. Akaike. Autoregressive model fitting forcontrol. Ann. Inst. Statist. Math., 23:163-180, 1971 and Schwarz'sBayesian Criterion G. Schwarz. Estimating the dimension of a model. Ann.Statist., 6:461-464, 1978. Using dynamical models up to an order of 8,it was found that according to Schwarz's Bayesian Criterion, thetraining sequences using the process were best approximated by anautoregressive model of second order.

From a training sequence of 151 consecutive silhouettes, the parametersof a second order autoregressive model are estimated. This model wassubsequently evaluated by plotting the autocorrelation functions of theresiduals associated with each of the modeled eigenmodes—see FIG. 2.These show that the residuals are essentially uncorrelated. Thus, withthe autocorrelation functions shown in FIG, 2, one can validate theautoregressive model provided in Step 104 of FIG. 13, by plotting theautocorrelation functions of the residuals associated with the firstfour shapes modes. Clearly these residuals are statistically correlated.

In addition, the estimated model parameters allow the process tosynthesize a walking sequence according to equation (13). To remove thedependency on the initial conditions, the first several hundred sampleswere discarded. FIG. 3 shows the temporal evolution of the first, secondand sixth eigenmode in the input sequence (left) and in the synthesizedsequence (right). Clearly, the second order model captures some of thekey elements of the oscillatory behavior. The original shape sequence,left, and the sequence synthesized by a statistical learned second orderMarkov chain, right, in accordance with Step 104 (FIG. 13) exhibitsimilar oscillatory behavior and amplitude modulation. These plots showthe temporal evolution of the first, second and sixth shape eigenmode.The original shape sequence (left) and the sequence synthesized by astatistically learned second order Markov chain (right) exhibit similaroscillatory behavior and amplitude modulation. The plots show thetemporal evolution of the first, second and sixth shape eigenmode.

While the synthesized sequence does capture the characteristic motion ofa walking person, FIG. 4 shows that the individual synthesizedsilhouettes do not in all instances mimic valid shapes. It is believesthat such limitations can be expected from a model which stronglycompresses the represented input sequence: Instead of 151 shapes definedon a 256×256 grid, the model merely retains a mean shape φ₀, 6eigenmodes ψ and the autoregressive model parameters given by a6-dimensional mean and three 6×6 matrices. This amounts to 458851instead of 9895936 parameters, corresponding to a compression to 4.6% ofthe original size. While the synthesis of dynamical shape models usingautoregressive models has been studied before [A. Blake and M. Isard.Active Contours, Springer, London, 1998], it should be noted that thesynthesizing of the shapes is based on an implicit representation. Moreparticularly, FIG. 4 shows synthetically generated walking sequenceproduced by the process. The sample silhouettes are generated by thestatistically learned second order Markov model on the embeddingfunctions-see equation (13). While the Markov model captures much of thetypical oscillatory behavior of a walking person, not all generatedsimples corresponds to permissible shapes—cf. the last two silhouetteson the bottom right. Yet as described in section 5 bellow the model issufficiently accurate to appropriately constrain a segment process. FIG.4 show synthetically generated walking sequence by the process. Samplesilhouettes generated by a statistically learned second order Markovmodel on the embedding functions—see equation (13). While the Markovmodel captures much of the typical oscillatory behavior of a walkingperson, not all generated samples correspond to permissible shapes—cf.the last two silhouettes on the bottom right. Yet, as will be describedin Section 5, the model is sufficiently accurate to appropriatelyconstrain a segmentation process.

Referring to FIG. 14, a sequence of statistically synthesized embeddingfunctions is shown and the induced contours given by the zero level lineof the respective surfaces are also shown. In particular, this implicitrepresentation allows to synthesize shapes of varying topology. Thesilhouette on the bottom left of FIG. 14 for example, consists of twocontours. The sequence are statistically generated embedding surfacesobtained by sampling from a second order autoregressive model, and thecontours given by the zero level lines of the synthesized surfaces. Theimplicit formulation allows the embedded contour to change topology(bottom left image).

3.2 Joint Dynamics of Deformation and Transformation

In the previous section, autoregressive models were introduced tocapture the temporal dynamics of implicitly represented shapes. To thisend, the degrees of freedom corresponding to transformations such astranslation and rotation were removed before performing the learning ofdynamical models. As a consequence, the learning only incorporatesdeformation modes, neglecting all information about pose and location.The synthesized shapes in FIG. 4, for example, show a walking personwhich is walking “on the spot”.

In general, one can expect the deformation parameters α_(t) and thetransformation parameters θ_(t) to be tightly coupled. A model whichcaptures the joint dynamics of shape and transformation would clearly bemore powerful than one which neglects the transformations. Yet, theprocess learns dynamical shape models which are invariant totranslation, rotation and other transformations. To this end, one canmake use of the fact that the transformations form a group which impliesthat the transformation θ_(t) at time t can be obtained from theprevious transformation θ_(t−1) by applying an incrementaltransformation □θ_(t): T_(θ) _(t) x=T_(□θ) _(t) T_(θ) _(t−1) x. Insteadof learning models of the absolute transformation θ_(t), the processsimply learns models of the update transformations □θ_(t) (e.g. thechanges in translation and rotation). By construction, such models areinvariant with respect to the global pose or location of the modeledshape.

To jointly model transformation and deformation, the process simplyobtains for each training shape in the learning sequence the deformationparameters α_(i) and the transformation changes □θ_(i), and fit theautoregressive models given in equations (14) and (15) to the combinedvector $\begin{matrix}{\alpha_{t} = {\begin{pmatrix}\alpha_{t} \\{\bullet\theta}_{t}\end{pmatrix}.}} & (16)\end{matrix}$

In the case of the walking person, it was found that—as in thestationary case—a second order autoregressive model gives the best modelfit. Synthesizing from this model allows to generate silhouettes of awalking person which are similar to the ones shown in FIG. 4, but whichmove forward in space, starting from an arbitrary (user-specified)initial position.

4. Dynamical Shape Priors in Variational Segmentation

Given an image I_(t) from an image sequence and given a set ofpreviously segmented shapes with shape parameters α_(1:t−1) andtransformation parameters θ_(1:t−1), the goal of tracking is to maximizethe conditional probability equation (11) with respect to shape α_(t)and transformation θ_(t). This can be performed by minimizing itsnegative logarithm, which is—up to a constant—given by an energy of theform:E(α_(t),θ_(t))=E _(data)(α_(t),θ_(t))+νE _(shape)(α_(t),θ_(t)).  (17

The additional weight ν was introduced to allow a relative weightingbetween prior and data term. In particular if the intensity informationis not consistent with the assumptions (Gaussian intensity distributionsof object and background), a larger weight of ν is preferable. The dataterm is given by: $\begin{matrix}{{{E_{data}\left( {\alpha_{t},\theta_{t}} \right)} = {{\int\limits_{\Omega}{\left( {\frac{\left( {I_{t} - \mu_{1}} \right)^{2}}{2\sigma_{1}^{2}} + {\log\quad\sigma_{1}}} \right)H\quad\phi_{\alpha_{t},\theta_{t}}}} + {\left( {\frac{\left( {I_{t} - \mu_{2}} \right)^{2}}{2\sigma_{2}^{2}} + {\log\quad\sigma_{2}}} \right)\left( {1 - {H\quad\phi_{\alpha_{t},\theta_{t}}}} \right){\mathbb{d}x}}}},} & \left( 18 \right.\end{matrix}$where, for notational simplicity, the expression below is introducedφ_(α) _(t) _(,θ) _(t) ≡φ₀(T _(θ) _(t) x)+α_(t) ^(•)ψ(T _(θ) _(t)x),  (19

to denote the embedding function of a shape generated with deformationparameters α_(t) and transformed with parameters θ_(t).

Using the autoregressive model equation (14), the shape energy is givenby: $\begin{matrix}{{E_{shape}\left( {\alpha_{t},\theta_{t}} \right)} = {\frac{1}{2}v^{*}{\sum^{- 1}v}}} & \left( 20 \right.\end{matrix}$with ν defined in equation (15). To incorporate the joint model ofdeformation and transformation introduced in Section 1, the aboveexpression for ν needs to be enhanced by the relative transformations□θ: $\begin{matrix}{{v \equiv {\begin{pmatrix}\alpha_{t} \\{\bullet\theta}_{t}\end{pmatrix} - \mu - {A_{1}\begin{pmatrix}\alpha_{t - 1} \\{\bullet{\hat{\theta}}_{t - 1}}\end{pmatrix}} - {{A_{2}\begin{pmatrix}\alpha_{t - 2} \\{\bullet{\hat{\theta}}_{t - 2}}\end{pmatrix}}\ldots} - {A_{k}\begin{pmatrix}\alpha_{t - k} \\{\bullet{\hat{\theta}}_{t - k}}\end{pmatrix}}}},} & (21)\end{matrix}$where μ and A_(i) denote the statistically learned mean and transitionmatrices for the joint space of deformations and transformations, and kis the model order. In experiments, a model order of k=2 was selected.

One can easily show that a second order autoregressive model can beinterpreted as a stochastic version of a time-discrete damped harmonicoscillator. As a consequence, it is well suited to model essentiallyoscillatory shape deformations. However, it was found that higher-orderautoregressive models provide qualitatively similar results.

Compute optimal segmentation of test sequence which is consistent withthe learned dynamical model. This is done by finding the shape vectorwhich maximizes the conditional probability in equation (11).Maximization is implemented by performing gradient descent on thenegative logarithm of this probability. This is shown in equation (22).Intuitively this process deforms the shape such that it matches best,both with the current image and with the prediction of the dynamicalmodel. The optimal shape for each test image is shown in FIGS. 9, 10,11, and 13.

Tracking an object of interest over a sequence of images I_(1:t) with adynamical shape prior can be done by minimizing energy equation (17). Agradient descent strategy was used, leading to the followingdifferential equations to estimate the shape vector α_(t):$\begin{matrix}{\frac{\mathbb{d}{\alpha_{t}(\tau)}}{\mathbb{d}\tau} = {{- \frac{\partial{E_{data}\left( {\alpha_{t},\theta_{t}} \right)}}{\partial\alpha_{t}}} - {v\frac{\partial{E_{shape}\left( {\alpha_{t},\theta_{t}} \right)}}{\partial\alpha_{t}}}}} & \left( 22 \right.\end{matrix}$where τ denotes the artificial evolution time, as opposed to thephysical time t. The data term is given by:${\frac{\partial E_{data}}{\partial\alpha_{t}} = \left\langle {\psi,{{\delta\left( {\phi\alpha}_{t} \right)}\left( {\frac{\left( {I_{t} - \mu_{1}} \right)^{2}}{2\sigma_{1}^{2}} - \frac{\left( {I_{t} - \mu_{2}} \right)^{2}}{2\sigma_{2}^{2}} + {\log\frac{\sigma_{1}}{\sigma_{2}}}} \right)}} \right\rangle},$and the shape term is given by: $\begin{matrix}{{\frac{\partial E_{shape}}{\partial\alpha_{t}} = {{\frac{\partial v}{\partial\alpha_{t}}\Sigma^{- 1}v} = {\begin{pmatrix}1_{n} & 0 \\0 & 0\end{pmatrix}\Sigma^{- 1}v}}},} & (23)\end{matrix}$with ν given in equation (21) and 1_(n) being the n-dimensional unitmatrix modeling the projection on the shape components of ν, where n isthe number of shape modes. These two terms affect the shape evolution inthe following manner: The first term draws the shape to separate theimage intensities according to the two Gaussian intensity models. Sincevariations in the shape vector α_(t) affect the shape through theeigenmodes ψ, the data term is a projection onto these eigenmodes. Thesecond term induces a relaxation of the shape vector α_(t) toward themost likely shape, as predicted by the dynamical model based on theshape vectors and transformation parameters obtained for previous timeframes.

Similarly, minimization with respect to the transformation parametersθ_(t) is obtained by evolving the respective gradient descent equationgiven by: $\begin{matrix}{\frac{\mathbb{d}{\theta_{t}(\tau)}}{\mathbb{d}\tau} = {{- \frac{\partial{E_{data}\left( {\alpha_{t},\theta_{t}} \right)}}{\partial\theta_{t}}} - {v\frac{\partial{E_{shape}\left( {\alpha_{t},\theta_{t}} \right)}}{\partial\theta_{t}}}}} & \left( 24 \right.\end{matrix}$where the data term is given by $\begin{matrix}{{\frac{\partial{E_{data}\left( {\alpha_{t},\theta_{t}} \right)}}{\partial\theta_{t}} = \left\langle {{{\nabla\psi}\frac{\mathbb{d}\left( {T_{\theta_{t}}x} \right)}{\mathbb{d}\theta_{t}}},{{\delta\left( {\phi_{\alpha}}_{t} \right)}\left( {\frac{\left( {I_{t} - \mu_{1}} \right)^{2}}{2\sigma_{1}^{2}} - \frac{\left( {I_{t} - \mu_{2}} \right)^{2}}{2\sigma_{2}^{2}} + {\log\frac{\sigma_{1}}{\sigma_{2}}}} \right)}} \right\rangle},} & (25)\end{matrix}$and the driving term from the prior is given by: $\begin{matrix}{{\frac{\partial E_{shape}}{\partial\theta_{t}} = {{\frac{\partial v}{\partial\theta_{t}}\Sigma^{- 1}v} = {\frac{\mathbb{d}\left( {\bullet\theta}_{t} \right)}{\mathbb{d}\theta_{t}}\begin{pmatrix}0 & 0 \\0 & 1_{s}\end{pmatrix}\Sigma^{- 1}v}}},} & \left( 26 \right.\end{matrix}$where, as above, the shape prior contributes a driving force toward themost likely transformation predicted by the dynamical model. The blockdiagonal matrix in equation (11) simply models the projection onto the stransformation components of the joint vector ν defined in equation(21).

5. Experimental Results 5.1. Dynamical Versus Static Statistical ShapePriors

In the following, dynamical statistical shape prior introduced above forthe purpose of level set based tracking will be introduced.

To construct the shape prior, in accordance with the process, and asnoted in Step 100 of FIG. 13, hand-segmentation of, in this example, asequence of a walking person, centered and binarized each shape isobtained. Subsequently, the process determines the set of signeddistance functions {φ_(i)}_(i=1 . . . N) associated with each shape andcomputed the dominant 6 eigenmodes, Step 102 in FIG. 13. Projecting eachtraining shape onto these eigenmodes, the process obtains in Step 104, asequence of shape vectors {α_(i)εR⁶}_(i=1 . . . N). The process fits asecond order multivariate autoregressive model to this sequence bycomputing the mean vector μ, the transition matrices A₁,A₂εR^(6×6) andthe noise covariance ΣεR^(6×6) shown in equation (14). Subsequently, theprocess compares segmentations of noisy sequences obtained bysegmentation in the 6-dimensional subspace without and with thedynamical statistical shape prior.

The segmentation without dynamical prior corresponds to that obtainedwith a static uniform prior in the subspace of the first few eigenmodes,as proposed in A. Tsai, A. Yezzi, W. Wells, C. Tempany, D. Tucker, A.Fan, E. Grimson, and A. Willsky. Model-based curve evolution techniquefor image segmentation. In Comp. Vision Patt. Recog., pages 463-468,Kauai, Hawaii, 2001. While there exist alternative models for staticshape priors (for example the Gaussian model M. Leventon, W. Grimson,and O. Faugeras. Statistical shape influence in geodesic activecontours. In CVPR, volume 1, pages 316-323, Hilton Head Island, S.C.,2000 or non-parametric statistical models D. Cremers, S. J. Osher, andS. Soatto. Kernel density estimation and intrinsic alignment forknowledge-driven segmentation: Teaching level sets to walk. In PatternRecognition, volume 3175 of LNCS, pages 36-44. Springer, 2004 and M.Rousson and D. Cremers [MICCAI, volume 1, pages 757-764], it was foundin experiments that all of these exhibit a qualitatively similarlimitation when applied to image sequence segmentation (see FIG. 7) theytend to get stuck in local minima because they do not exploit temporalshape correlations.

In Step 108, the process then applies the model against future,sequential motion of the object in the presence of unwanted phenomena bymaximizing the probability that the developed statistical shape modelmatches the sequential motion of the object in the presence of unwantedphenomena.

Thus, referring to FIG. 5, a sample input frame from a sequence with25%, 50%, 75% and 90% noise is shown. (It is noted that noise means that25% of all pixels were replaced by a random intensity sampled from auniform distribution. It is interesting to note that our algorithmeasily handles uniform noise, although its probabilistic formulation isbased on the assumption of Gaussian noise.) FIG. 6 shows a set ofsegmentations obtained with a uniform static shape prior on a sequencewith 25% noise. FIG. 6 shows a Segmentation using a static shape priorfor 25% noise. Constraining the level set evolution to a low-dimensionalsubspace allows to cope with a certain amount of noise. While thissegmentation without dynamical prior is successful in the presence ofmoderate noise, FIG. 7 shows that the segmentation without dynamicalprior eventually breaks down when the noise level is increased. FIG. 7shows segmentation using a static shape prior for 50\% noise. Using astatic (uniform) shape prior, the segmentation scheme cannot cope withlarger amounts of noise. It gets stuck in a local minimum after thefirst few frames. Since static shape priors do not provide forpredictions in time, they have a tendency of getting stuck to the shapeestimate obtained on the previous image. More particularly, FIG. 6 showssegmentations with a static shape prior on a walking sequence with 25%noise. Constraining the level set evolution to a low-dimensionalsubspace allows to cope with a certain amount of noise and FIG. 7 showssegmentations with a static shape prior on a walking sequence with 50%noise. Using merely a static shape prior, the segmentation scheme cannotcope with larger amounts of noise.

FIG. 8 shows segmentations of the same sequence as in FIG. 7 obtainedwith a dynamical statistical shape prior derived from a second orderautoregressive model. FIG. 8 shows segmentation using a dynamical shapeprior for 50% noise. In contrast to the segmentation with a static priorshown in FIG. 7, the dynamical prior (using a second-orderautoregressive model) imposes statistically learned information aboutthe temporal dynamics of the shape evolution to cope with missing ormisleading low-level information.

FIGS. 9 and 10 show that the statistical shape prior provides for goodsegmentations, even with 90% noise. Clearly, exploiting the temporalstatistics of dynamical shapes allows to make the segmentation processvery robust to missing and misleading information. More particularly,FIG. 8 shows segmentation using a dynamical statistical shape priorbased on a second order autoregressive model. In contrast to thesegmentation in FIG. 7, the prior imposes statistically learnedinformation about the temporal dynamics of the shape evolution to copewith misleading low-level information and FIG. 9 shows tracking withdynamical statistical shape prior to cope with larger amounts of noise.The input images were corrupted with 90% of noise. Yet, thestatistically learned dynamical shape model allows to disambiguate thelow-level information. These experiments confirm that the trackingschemes can indeed compete with the capacities of human observers. FIG.9 shows tracking with dynamical shaped prior for 75% of noise. Thestatistical learned dynamical shaped model allows to disambiguate thelow level information. FIG. 10 shows tracking with dynamical shape priorfor 90\% of noise. Quantitative evaluation with respect to the groundtruth, shown in FIG. 11, left side, indicates that our tracking schemecan indeed compete with the capacities of human observers, providingreliable segmentations where human observers fail. The segmentationresults on the first few frames are inaccurate, since the segmentationprocess with a dynamical shape prior accumulates image information (andmemory) over time.

5.2. Quantitative Evaluation of the Noise Robustness

In order to quantify the accuracy of segmentation, hand segmentations ofthe original test sequence were used. Subsequently, the following errormeasure was defined: $\begin{matrix}{{ɛ = \frac{\int{\left( {{H\quad{\phi(x)}} - {H\quad{\phi_{0}(x)}}} \right)^{2}{\mathbb{d}x}}}{{\int{H\quad{\phi(x)}{\mathbb{d}x}}} + {\int{H\quad{\phi_{0}(x)}{\mathbb{d}x}}}}},} & (27)\end{matrix}$where H is again the Heaviside step function, φ₀ is the truesegmentation and φ₀ the estimated one. This error corresponds to therelative area of the set symmetric difference, i.e. the union of bothsegmentations minus its intersection, divided by the areas of eachsegmentation. While there exist numerous measures of segmentationaccuracy, it was decided for this measure, because it takes on valueswithin the well-defined range 0≦ε≦1, where ε=0 corresponds to theperfect segmentation.

FIG. 11, left side, shows the segmentation error averaged over a testsequence as a function of the noise level. A dynamical shape prior ofdeformation and transformation (Section 3.3) were used, initializing thesegmentation process with an estimate of the initial location. The plotshows several things: Firstly, the error remains fairly constant fornoise levels below 60%. This is due to the fact that the weight ν of theprior is set to a fixed value for all experiments (ideally one would usesmaller weights for less noise). Therefore, the residual error of around5% stems from the discrepancy between the estimated dynamical model andthe true sequence, accumulating errors introduced by the principalcomponent approximation and the approximation by autoregressive models.Secondly, as can be expected, the error increases for larger values ofnoise. The deviation from monotonicity (in particular at 90% noise) isprobably an effect of statistical fluctuation. The initial pose estimatein conjunction with prior knowledge about the translational component ofwalking leads to the fact that the error is below that of a randomsegmentation even in the case of 100% noise. FIG. 11 shows quantitativeevaluation of the segmentation accuracy. The relative segmentation erroris plotted for increasing amounts of noise (left) and for varyingwalking speed (right). Even for 100% noise the segmentation errorremains well below 1 because the process integrates a good estimate ofthe initial position and a model of the translational motion. The ploton the right shows that for walking speeds ν slower than the learned onev₀, the segmentation error (with 70% noise) remains low, whereas forfaster walking sequences, the accuracy slowly degrades. Yet, even forsequences of 5 times the learned walking speed, segmentation with adynamical prior outperforms the segmentation with a static prior.

5.3. Robustness to Variations in Frequency and Frame Rate

The dynamical shape prior introduces a prior knowledge about how likelycertain silhouettes are, given the segmentations obtained on the lastfew frames. Let us assume that the process has learned a dynamical modelof a walking person from a sequence of a fixed walking speed ν₀.Clearly, the estimated model will be tuned to this specific walkingspeed. Yet, when applying such a model in practice, one cannot guaranteethat the person in the test sequence will be walking at exactly the samespeed. Equivalently, one may not be sure—even if the walking speed isidentical—that the camera frame rate is the same. In order to bepractically useful, the proposed prior must be robust to variations inthe walking frequency and frame rate.

To validate this robustness, test sequences of different walking speedby either dropping certain frames were synthesized (in order to speed upthe gait) or by replicating frames (thereby slowing down the gait). FIG.11, right side, shows the segmentation error ε, defined in equation(27), averaged over test sequences with 70% noise and speeds which varyfrom ⅕ the speed of the training sequence to 5 times the original speed.While the accuracy is not affected by slowing down the sequence, itdegrades gradually once the speed is increased. Yet, the segmentationprocess is quite robust to such drastic changes in speed. The reason forthis robustness is twofold: Firstly, the Bayesian formulation allows tocombine model prediction and input data in a way that the segmentationprocess constantly adapts to the incoming input data. Secondly, theautoregressive model only relies on the last few estimated silhouettesto generate a shape probability for the current frame. It does notassume long range temporal consistency and can thus handle sequenceswith varying walking speed. The experiments show that even for sequencesof 5 times the original walking sequence segmentation with a dynamicalmodel is superior to segmentation with a static model. This is notsurprising: In contrast to the static model, the dynamical model doesprovide a prediction of the temporal shape evolution. Even if thisprediction may be sub-optimal for strongly differing walking speeds, itstill allows to enhance the segmentation process.

5.4. Joint Dynamics of Deformation and Transformation

In Section 3.2, dynamical models were introduced to capture the jointevolution of deformation and transformation parameters. On the tasksshown so far, pure deformation models and joint models of deformationand transformation were found to provide similar segmentation results.While the joint model provides a prior about the transformationparameters which are most likely at a given time instance, thepure-deformation model requires these parameters to be estimated solelyfrom the data.

As a final example, a segmentation task is generated where thetransformation parameters cannot be reliably estimated from the data dueto a prominent occlusion. The test sequence shows a person walking fromright to left and an occluding bar moving from left to right, corruptedby 80% noise. FIG. 12, top row, shows segmentations obtained with adynamical shape prior capturing both deformation and transformation.Even when the walking silhouette is completely occluded, the model iscapable of generating silhouettes walking to the left and adapts to theimage data, once the figure reappears.

The bottom row of FIG. 12, on the other hand, shows the segmentation ofthe same frames with a dynamical model which only incorporates the shapedeformation. Since no knowledge about translation is assumed, thesegmentation process needs to rely entirely on the image information inorder to estimate the transformation parameters. As a consequence, thesegmentation process is mislead by the prominent occlusion. When thefigure reappears from behind the bar, the process integratescontradictory information about translation provided by the personwalking to the left and by the bar moving to the right. Once the figureof interest is lost, the prior simply “hallucinates” silhouettes of aperson walking “on the spot”—see the last image on the bottom right.Although a “failed” experiment, it is believed that this resultilluminates best how the dynamical model and the image information arefused within the Bayesian formulation for image sequence segmentation.FIG. 12 shows tracking in the presence of occlusion. The input sequenceshows a person walking to the left occluded by a bar moving to theright. While the top row is generated with a dynamical prior integratingboth deformation and transformation, the bottom row uses a dynamicalprior which merely captures the deformation component. Since the latterdoes not provide predictions of the translational motion, the estimationof translation is purely based on the image data. It is mislead by theocclusion and cannot recover, once the person reappears from behind thebar.

6. Conclusion

With the process described above, dynamical statistical shape models areused for implicitly represented shapes. In contrast to existing shapemodels for implicit shapes, these models capture the temporalcorrelations which characterize deforming shapes such as the consecutivesilhouettes of a walking person. Such dynamical shape models account forthe fact that the probability of observing a particular shape at a giventime instance may depend on the shapes observed at previous timeinstances.

For the construction of statistical shape models, the process extendedthe concepts of Markov chains and autoregressive models to the domain ofimplicitly represented shapes. The resulting dynamical shape modelstherefore allow to handle shapes of varying topology. Moreover, they areeasily extended to higher-dimensional shapes (i.e. surfaces).

The estimated dynamical models allow to synthesize shape sequences ofarbitrary length. For the case of a walking person, the accuracy of theestimated dynamical models was validated, comparing the dynamical shapeevolution of the input sequence to that of synthesized sequences forvarious shape eigenmodes, and verifying that the residuals arestatistically uncorrelated. Although the synthesized shapes do not inall instances correspond to valid shapes, one can nevertheless use thedynamical model to constrain a segmentation and tracking process in sucha way that it favors familiar shape evolutions.

To this end, a Bayesian formulation was developed for level set basedimage sequence segmentation, which allows to impose the statisticallylearned dynamical models as a shape prior for segmentation processes. Incontrast to most existing approaches to tracking, autoregressive modelsare integrated as statistical priors in a variational approach which canbe minimized by local gradient descent (rather than by stochasticoptimization methods).

Experimental results confirm that the dynamical shape priors outperformstatic shape priors when tracking a walking person in the presence oflarge amounts of noise. Quantitative evaluation of the segmentationaccuracy as a function of noise as provided. Moreover, the model-basedsegmentation process was found to be quite robust to large (up to afactor of 5) variations in frame rate and walking speed. Further, adynamical prior in the joint space of deformation and transformation wasshown to outperform a purely deformation-based prior, when tracking awalking person through prominent occlusions.

A number of embodiments of the invention have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention.Accordingly, other embodiments are within the scope of the followingclaims.

1. A method for detecting and tracking a deformable object having asequentially changing behavior, comprising: developing a temporalstatistical shape model of the sequentially changing behavior of theembedding function representing the object from prior motion; and thenapplying the model against future, sequential motion of the object inthe presence of unwanted phenomena by maximizing the probability thatthe developed statistical shape model matches the sequential motion ofthe object in the presence of unwanted phenomena.
 2. A method,comprising: generating a dynamical model of the time-evolution of theembedding function of prior observations of a boundary shape of anobject, such object having an observable, sequentially changing boundaryshape; and subsequently using such model for probabilistic inferenceabout such shape of the object in the future.
 3. A method for detectingand tracking a deformable object having a sequentially changingbehavior, comprising: developing a temporal statistical shape model ofthe sequentially changing behavior of the embedding functionrepresenting the object from prior motion, comprising: developing ahand-segmented image sequence showing training shapes of the object andlevel set functions for each one of the shapes; computing a sequence ofshape vectors comprising principal component representations of thelevel set functions; estimating a dynamical model of for the sequence ofshape vectors; and determining the one of the sequence of shape vectorshaving the maximum probabilty of matching the dynamical model.
 4. Themethod recited in claim 3 wherein the estimating of the dynamical modelcomprises using an autoregressive model for the sequence of shapevectors.
 5. The method recited in claim 3 wherein the maximumprobability determining comprises maximizing a conditional probability.6. The method recited in claim 3 wherein the maximum probabilitydetermining comprises performing gradient descent on the negativelogarithm of the conditional probability.
 7. The method recited in claim3 wherein the estimating of the dynamical model comprises approximatingthe shape vectors α_(t)≡αa_(φ) _(t) of the sequence of level setfunctions by a Markov chain of order k, in accordance withα_(t)=μ+A₁α_(t−1)+A₂α_(t−2)+ . . . +A_(k)α_(t−k)+η, where η is zero-meanGaussian noise with covariance Σ.
 8. The method recited in claim 6wherein given consecutive images I_(t): Ω→R from an image sequence, andgiven the segmentations α_(1:t−1) and transformations {circumflex over(θ)}_(1:t−1) obtained on the previous images I_(1:t−1) the maximumprobability determining comprises: maximizes a conditional probability$\begin{matrix}{{{P\left( {\alpha_{t},\left. \theta_{t} \middle| I_{t} \right.,\alpha_{1:{t - 1}},{\hat{\theta}}_{1:{t - 1}}} \right)} = \frac{{P\left( {\left. I_{t} \middle| \alpha_{t} \right.,\theta_{t}} \right)}{P\left( {\alpha_{t},\left. \theta_{t} \middle| \alpha_{1:{t - 1}} \right.,{\hat{\theta}}_{1:{t - 1}}} \right)}}{P\left( {\left. I_{t} \middle| \alpha_{1:{t - 1}} \right.,{\hat{\theta}}_{1:{t - 1}}} \right)}},} & (11)\end{matrix}$ with respect to the shape vectors α_(t) and transformationparameters θ_(t); wherein the conditional probability is modeled as:P(α_(t),θ_(t)|α_(1:t−1),{circumflex over (θ)}_(1:t−1)),  (12)  whichconstitutes the probability for observing a particular shape α_(t) and aparticular transformation θ_(t) at time t, conditioned on the parameterestimates for shape and transformation obtained on previous images. 9.The method recited in claim 6 wherein the performing of the gradientdescent on the negative logarithm of the conditional probabilitycomprises using a gradient descent strategy, leading to the followingdifferential equations to estimate the shape vector α_(t):$\begin{matrix}{\frac{\mathbb{d}{\alpha_{t}(\tau)}}{\mathbb{d}\tau} = {{- \frac{\partial{E_{data}\left( {\alpha_{t},\theta_{t}} \right)}}{\partial\alpha_{t}}} - {v\frac{\partial{E_{shape}\left( {\alpha_{t},\theta_{t}} \right)}}{\partial\alpha_{t}}}}} & \left( 22 \right.\end{matrix}$ where τ denotes the artificial evolution time, as opposedto the physical time t.